In the load_data chunk we have loaded all the data we will need for the analysis. We have 3 data.frames:
df.counts.raw: contains the counts for each gene an sample.df.submission: contains information about the RNA-extraction samples.df.design: maps sample_ids to the conditions of the experiment.In the df.counts.raw data.frame, samples are identified by the sequence_id column:
| sequence_id |
|---|
| S01 |
| S02 |
| S03 |
| S04 |
| S05 |
| S06 |
In order to map sequence_id to sample_id we to combine the information in df.submission and df.design:
| isolation_date | sample_id | sequence_id | volume_ul | conc_ng_ul | A260 | ratio_260_230 | ratio_260_280 | yield_ug |
|---|---|---|---|---|---|---|---|---|
| 2020-01-19 | 191226_PC_2_1 | S01 | 59 | 63.202 | 1.5800 | 1.985 | 2.061 | 3.728918 |
| 2020-01-19 | 191226_PC_2_3 | S02 | 59 | 118.006 | 2.9502 | 1.454 | 2.224 | 6.962354 |
| 2020-01-19 | 191226_PC_2_4 | S03 | 59 | 29.625 | 0.7400 | 1.305 | 2.095 | 1.747875 |
| 2020-01-19 | 200116_PC_3_1 | S04 | 59 | 94.049 | 2.3512 | 2.197 | 2.233 | 5.548891 |
| 2020-01-19 | 200116_PC_3_3 | S05 | 59 | 169.249 | 4.2312 | 2.255 | 2.153 | 9.985691 |
| 2020-01-19 | 200116_PC_3_4 | S06 | 59 | 73.703 | 1.8426 | 2.185 | 2.108 | 4.348477 |
| multicultivator | channel | strain | sample_time | sample_id | phosphate_mM | purpose | dataset |
|---|---|---|---|---|---|---|---|
| 0 | 1 | WT | NA | 191104_PC_1_1 | 0.2300 | control | pre-coure |
| 0 | 1 | WT | NA | 191226_PC_2_1 | 0.0115 | treatment | pre-coure |
| 0 | 1 | WT | NA | 200116_PC_3_1 | 0.2300 | control | pre-coure |
| 0 | 2 | WT | NA | 191104_PC_1_2 | 0.2300 | control | pre-coure |
| 0 | 3 | WT | NA | 191104_PC_1_3 | 0.2300 | control | pre-coure |
| 0 | 3 | WT | NA | 191226_PC_2_3 | 0.0115 | treatment | pre-coure |
Create a data.frame called df.transcriptomics.design combining the required information:
| sample_id | sequence_id | multicultivator | channel | strain | sample_time | phosphate_mM | purpose | dataset |
|---|---|---|---|---|---|---|---|---|
| 191226_PC_2_1 | S01 | 0 | 1 | WT | NA | 0.0115 | treatment | pre-coure |
| 191226_PC_2_3 | S02 | 0 | 3 | WT | NA | 0.0115 | treatment | pre-coure |
| 191226_PC_2_4 | S03 | 0 | 4 | WT | NA | 0.0115 | treatment | pre-coure |
| 200116_PC_3_1 | S04 | 0 | 1 | WT | NA | 0.2300 | control | pre-coure |
| 200116_PC_3_3 | S05 | 0 | 3 | WT | NA | 0.2300 | control | pre-coure |
| 200116_PC_3_4 | S06 | 0 | 4 | WT | NA | 0.2300 | control | pre-coure |
As mentioned above, the df.counts.raw data.frame contains the counts for each gene an sample. In addition, it contains some rows indicating the number of reads that were not aligned, ambiguous and for which no feature was found. This information is stored in the last rows of each counts file.
| gene | counts | sequence_id |
|---|---|---|
| __no_feature | 666901 | S01 |
| __ambiguous | 192804 | S01 |
| __too_low_aQual | 0 | S01 |
| __not_aligned | 256021 | S01 |
| __alignment_not_unique | 0 | S01 |
| __no_feature | 1167785 | S02 |
| __ambiguous | 382194 | S02 |
| __too_low_aQual | 0 | S02 |
| __not_aligned | 352425 | S02 |
| __alignment_not_unique | 0 | S02 |
| __no_feature | 819476 | S03 |
| __ambiguous | 270242 | S03 |
| __too_low_aQual | 0 | S03 |
| __not_aligned | 212572 | S03 |
| __alignment_not_unique | 0 | S03 |
| __no_feature | 915897 | S04 |
| __ambiguous | 294485 | S04 |
| __too_low_aQual | 0 | S04 |
| __not_aligned | 287451 | S04 |
| __alignment_not_unique | 0 | S04 |
| __no_feature | 829787 | S05 |
| __ambiguous | 286712 | S05 |
| __too_low_aQual | 0 | S05 |
| __not_aligned | 372251 | S05 |
| __alignment_not_unique | 0 | S05 |
| __no_feature | 769118 | S06 |
| __ambiguous | 255793 | S06 |
| __too_low_aQual | 0 | S06 |
| __not_aligned | 285180 | S06 |
| __alignment_not_unique | 0 | S06 |
We want to have information about the alignment process. For each sample, make a plot showing the percentage of reads that were not aligned, ambiguous and for which no feature was found.
For further analyses, we will need a data.frame that do NOT contain these rows. Create such data.frame and name it df.counts.
The goal of this exercise is to perform a quality control of the count data. In order to do this you will visualize the data using different types of plots.
The total number of counts for each sample might be different due to multiple factors.
Create a barplot showing the total number of counts for each sample.
How many reads do you (approximately) have from each sample? Are the samples different? Do you see outliers? Can you conclude that the values are consistent, or is a normalization step required?
Out of all the RNA present in a prokaryotic cell, roughly 85% is ribosomal (r)RNAs. Thus when sequencing RNA 85% of the reads will be on non-coding rRNA, while we are interested in mRNA. To improve sensitivity and dynamic range on the mRNA levels we remove the rRNA, called ribosomal RNA depletion. We should check if this procedure was effective.
Create a barplot showing the percentage of counts for each rRNA to the total number of counts.
What percentage of the reads is mapped to rRNA? Has each sample roughly the same amount of rRNA reads? What would be the effect of deviations in rRNA level when we compare the samples? Should we correct the data somehow for this?
Researchers often filter out the genes that are measured with a very little amount of reads. These measurements are then considered unreliable.
Create a new data.frame called df.counts.f in which you filter out all rRNA genes and all the genes that are measured with less than 24 reads in all samples. In addition, create a new column named counts_log2 that contains the counts in log space. We will need this column in the coming steps.
Hint: to avoid getting infinite values for genes with 0 counts, add 1 to all counts before you calculate the log. This is usually referred as pseudo-counts.
Did the filtering step have an impact on your data? Re-create the figure made in Excercise 2.1.1
Another way to check if data needs to be normalized is to visualize the distribution of the counts. This is usually done in log space.
Make a boxplot to visualize the distribution in gene expression level, for every sample.
Are the samples similar or different? Do you see outliers? Can you conclude that the values are consistent, or is a normalization step required?
Make a density plot to visualize the distribution in gene expression level, for every sample.
Are the samples similar or different? Do you see outliers? Can you conclude that the values are consistent, or is a normalization step required?
An important aspect of quality control is comparing the samples. A direct way of comparing the samples is to plot the gene expression levels of one sample against the gene expression levels of another sample.
In order to be able to plot the log2 count values of two samples against each other, we need to transform the df.counts.f to wide format, such that we have one row per gene and one sample per column. Create a new data.frame, df.counts.f.wide, that has this format:
Now you can plot the log2 count values of two sample against each other. Which sample is most similar to sample 1? In what way do the other samples differ from sample 1?
It is somewhat challenging to compare all 6 samples with each other. You can use the ggpairs function from the GGally package to do this automatically.
What is your overall impression? Are all samples very similar or do you see differences? Can you give examples of large differences?
Although this plot is very informative, it is often more informative and common to compare the gene expression values between samples (or conditions) using an MA plot. In an MA plot you plot the difference in gene expression for each gene on the y-axis. The total expression level is plotted on the x-axis. It was originally developed for microarray data (http://en.wikipedia.org/wiki/MA_plot), but is also widely used for RNA-seq data. It will become clear when you start making such plots:
In this MA plot you will compare sample 1 with sample 3:
First calculate the difference in log expression and then plot the difference against total expression:
The advantage is that differences from zero in the y-axis directly indicate a difference in log2 fold change. Each dot indicates a gene, and dots on the left side in the graph have a low expression level, and genes on the right side in the graph have a high expression level.
Note that log2 expression values are plotted. Are the samples similar? What can you conclude from this plot?
To analyze all 6 samples, it is handy to compare them all with a common reference. This common reference is the based median expression level of each gene across all samples. Use the df.counts.f data.frame to first calculate the log2 of the median counts per gene and then calculate M and A per sample.
Make a plot comparing every sample to the common reference.
Note that log2 expression values are plotted. Do you see any outliers?
An important plot for the Quality Control is the Principal Component Analysis plot. This allows you to infer the effect of the different experimental variables. In R, a PCA can be performed using prcomp() function. We already prepared helper functions for performing PCA’s and generating associated plots, you can find them in /Users/joeri/SurfDrive/MMP-UVA/Teaching/1920_SBiP/experiments/20200129_sbip/functions/pca.R.
(Adapted from https://tbradley1013.github.io/2018/02/01/pca-in-a-tidy-verse-framework )
We can use the f_pca function to perform the PCA on our data and also extract the most important information. It returns you a nested data.frame.
## Observations: 1
## Variables: 4
## $ data <list> [<tbl_df[6 x 3328]>]
## $ pca <list> [<2.802365e+01, 1.417579e+01, 4.776911e+00, 4.476511e+0…
## $ pca_aug <list> [<tbl_df[6 x 3335]>]
## $ pca_loadings <list> [<tbl_df[19962 x 3]>]
First look at the amount of variance explained by each component using the f_pca_var_exp function.
## # A tibble: 6 x 4
## pc variance var_exp cum_var_exp
## <dbl> <dbl> <dbl> <dbl>
## 1 1 7.85e+ 2 7.52e- 1 0.752
## 2 2 2.01e+ 2 1.92e- 1 0.944
## 3 3 2.28e+ 1 2.18e- 2 0.966
## 4 4 2.00e+ 1 1.92e- 2 0.985
## 5 5 1.57e+ 1 1.50e- 2 1
## 6 6 2.62e-29 2.50e-32 1
Or in plotted form using f_pca_plot_var_exp
Which components are most relevant for explaining the variance in our dataset?
Let’s inspect where our samples are placed on the first three components.
The goal of this exercise is to perform a quality control of the gene expression count data, after normalization. In order to do this you will visualize the data using different types of plots. You will analyze your data using a methodology that is called DESeq, which is described in:
And in
Normalizing count data is one of the steps in the DESeq workflow. It is done to correct for differences in sequencing depth among samples and it is based on the median-of-ratios method. You can find the mathematical formulation in link # 1 provided above. In short, size factors are calculated per sample and give a measure of the median distance all genes of one sample to the geometric mean of each gene across all samples.
To calculate the size factors we will follow the next steps (adapted from this tutorial):
Create a pseudo-reference sample: for each gene, calculate the geometric mean of counts. Note: you can also use the log-average to calculate this but then you will have to modify the coming steps accordingly.
Calculate ratio of each sample to the pseudo-reference: for each sample, calculate the ratio of each gene to the the pseudo-reference. Note: if the pseudo-reference is equal to 0, make this ratio NA.
Calculate the size factor for each sample: this is done by calculating the median of all ratios per sample. Hint: set the argument na.rm = T in the median function.
Normalize the count data: for each sample, divide the counts of each gene by the calculated size factor.
Print the calculate size factors for each sample in a table:
| sequence_id | size_factor |
|---|---|
| S01 | 0.8449753 |
| S02 | 1.5748351 |
| S03 | 1.1641719 |
| S04 | 0.9563367 |
| S05 | 0.8856490 |
| S06 | 0.7758809 |
Remake the plots you created in sections 2.1 and 2.4 using the normalized data
Plot the total number of counts per sample:
How many reads do you (approximately) have from each sample? Are the samples different? Do you see outliers? Has the normalization worked?
Make a boxplot to visualize the distribution of normalized counts, for every sample:
Are the samples similar or different? Do you see outliers? Can you conclude that the normalization worked?
You can also make density plots to visualize the distribution of normalized counts, for every sample:
Are the samples similar or different? Do you see outliers? Can you conclude that the normalization worked?
Repeat the PCA done section 2.7 using the normalized data.
Make a plot showing the explained variance for each component
Make a plot showing where out samples are place on the fist 3 components
Does this look different from the PCA with pre-normalized data? In what way?
We will now investigate which genes are contributing most to the difference between the samples along the first PC axis (PC1).
Most down-regulated genes of PC 1:
## # A tibble: 10 x 2
## gene loading
## <chr> <dbl>
## 1 sll1552 -0.227
## 2 sll0654 -0.201
## 3 sll0720 -0.180
## 4 nucH -0.165
## 5 pstS -0.152
## 6 pstC -0.132
## 7 sphX -0.130
## 8 pstA -0.120
## 9 sll0594 -0.107
## 10 pstB -0.105
Most up-regulated genes of PC 1:
## # A tibble: 10 x 2
## gene loading
## <chr> <dbl>
## 1 slr1634 0.137
## 2 amiC 0.0995
## 3 sll1863 0.0794
## 4 slr0376 0.0750
## 5 glnH/glnP 0.0667
## 6 sll7064 0.0589
## 7 slr5037 0.0550
## 8 sll7065 0.0543
## 9 sll1862 0.0540
## 10 petJ 0.0534
Another useful tool is to plot the expression levels in a heatmap.
The goal of this exercise is to find differentially expressed genes.
First, recall the experimental design. What are our experimental groups?
| sample_id | sequence_id | phosphate_mM | purpose |
|---|---|---|---|
| 191226_PC_2_1 | S01 | 0.0115 | treatment |
| 191226_PC_2_3 | S02 | 0.0115 | treatment |
| 191226_PC_2_4 | S03 | 0.0115 | treatment |
| 200116_PC_3_1 | S04 | 0.2300 | control |
| 200116_PC_3_3 | S05 | 0.2300 | control |
| 200116_PC_3_4 | S06 | 0.2300 | control |
We are analyzing our data using the DESeq methodology. Instead of reinventing the wheel we will use the DESeq2 package. You can find more information about the theoretical background of the package in the papers provided above.
DESeq2 needs as inputs, at least, 3 objects:
Create a data.frame containing the (filtered) counts data and the design data. It should look like this:
## # A tibble: 19,962 x 12
## gene counts sequence_id counts_log2 sample_id multicultivator channel strain
## <chr> <dbl> <chr> <dbl> <chr> <dbl> <dbl> <chr>
## 1 AT103 20 S01 4.39 191226_P… 0 1 WT
## 2 ESI3 92 S01 6.54 191226_P… 0 1 WT
## 3 ETR1 142 S01 7.16 191226_P… 0 1 WT
## 4 Era 305 S01 8.26 191226_P… 0 1 WT
## 5 HAGH1 163 S01 7.36 191226_P… 0 1 WT
## 6 IAP75 3169 S01 11.6 191226_P… 0 1 WT
## 7 ICFG 800 S01 9.65 191226_P… 0 1 WT
## 8 Sac1 282 S01 8.14 191226_P… 0 1 WT
## 9 aat 37 S01 5.25 191226_P… 0 1 WT
## 10 abfB 136 S01 7.10 191226_P… 0 1 WT
## # … with 19,952 more rows, and 4 more variables: sample_time <dttm>,
## # phosphate_mM <dbl>, purpose <chr>, dataset <chr>
Create the object for countData. Transform the df.diff to wide format, set the genes as row names and convert to matrix. It should look like this:
## S04 S05 S06 S01 S02 S03
## AT103 26 28 18 20 34 30
## ESI3 84 98 77 92 200 123
## ETR1 120 141 98 142 242 209
## Era 380 244 260 305 651 375
## HAGH1 196 139 121 163 313 248
## IAP75 3181 3046 2703 3169 5090 4282
Create the object for colData. It should look like this:
## # A tibble: 6 x 2
## sequence_id purpose
## <fct> <fct>
## 1 S04 control
## 2 S05 control
## 3 S06 control
## 4 S01 treatment
## 5 S02 treatment
## 6 S03 treatment
Create the DESeq2 object and estimate size factors to apply counts normalization
The previous function call creates a DESeq2 object which, among many other things, contains the normalized counts and the estimated size factors. These factors should be equal to the ones we calculated previously.
Check if the previous statement is true.
| sequence_id | size_factor |
|---|---|
| S01 | 0.8449753 |
| S02 | 1.5748351 |
| S03 | 1.1641719 |
| S04 | 0.9563367 |
| S05 | 0.8856490 |
| S06 | 0.7758809 |
You can access the normalized counts from the DESeq object by calling the counts function with normalized = TRUE. This is how the normalized data looks like:
## S04 S05 S06 S01 S02 S03
## AT103 27.18708 31.61523 23.19944 23.66933 21.58956 25.76939
## ESI3 87.83518 110.65331 99.24203 108.87892 126.99742 105.65450
## ETR1 125.47882 159.20528 126.30804 168.05225 153.66688 179.52675
## Era 397.34961 275.50417 335.10296 360.95730 413.37662 322.11738
## HAGH1 204.94874 156.94705 155.95176 192.90505 198.75097 213.02696
## IAP75 3326.23447 3439.28563 3483.78197 3750.40554 3232.08446 3678.15098
In RNA-seq count data there is a dependency between the variance and the mean that is addressed in the statistical procedures that are used for differential gene expression analysis. This plot visualizes the (overdispersed) mean-variance dependency in your normalized data:
Computing mean and variance:
DESeq2 resolves this issue using regression and shrinkage:
In order to extract the results of the DESeq analysis, we need to call the results() function on the previously created DESeq object. The following code extracts into a data.frame the log fold change, the p-values and adjusted p-values for differential expression.
| gene | baseMean | log2FoldChange | lfcSE | stat | pvalue | padj |
|---|---|---|---|---|---|---|
| AT103 | 25.50501 | -0.1916194 | 0.2986252 | -0.6418231 | 0.5209880 | 0.6338481 |
| ESI3 | 106.54356 | 0.1992797 | 0.1785884 | 1.1156377 | 0.2645773 | 0.3760105 |
| ETR1 | 152.03967 | 0.2734443 | 0.1596078 | 1.7129936 | 0.0867137 | 0.1525630 |
| Era | 350.73467 | 0.1199726 | 0.1430446 | 0.8386645 | 0.4016576 | 0.5194266 |
| HAGH1 | 187.08842 | 0.2142938 | 0.1507856 | 1.4210019 | 0.1553162 | 0.2458311 |
| IAP75 | 3484.99051 | 0.0554742 | 0.0897230 | 0.6182854 | 0.5363872 | 0.6484594 |
The results can be plotted in an MA plot:
…and you can plot the expression values of for instance the 10 most differentially expressed genes:
How many genes have an adj p-value < 0.01?
Answer: 1207.
Draw an histogram of the p-values and the adjusted p-values
Create a volcano plot.
Print a table with all the genes with adj p-value < 0.01 and log fold change > 2.
| gene | baseMean | log2FoldChange | lfcSE | stat | pvalue | padj |
|---|---|---|---|---|---|---|
| slr1634 | 19777.15878 | -5.285370 | 0.1254559 | -42.100258 | 0 | 0 |
| amiC | 9403.41756 | -3.904825 | 0.0883051 | -44.201980 | 0 | 0 |
| slr1403 | 8251.46024 | 3.965997 | 0.0842125 | 47.062210 | 0 | 0 |
| pstB | 8098.19056 | 4.102974 | 0.0883609 | 46.396667 | 0 | 0 |
| pstA | 8596.19340 | 4.714814 | 0.0867471 | 54.275118 | 0 | 0 |
| sphX | 2605.54493 | 5.067331 | 0.1227929 | 40.961615 | 0 | 0 |
| pstC | 15500.87089 | 5.175105 | 0.0851707 | 60.692507 | 0 | 0 |
| pstS | 40614.10586 | 5.957979 | 0.0978243 | 60.855515 | 0 | 0 |
| nucH | 20802.56889 | 6.461794 | 0.0949813 | 67.846384 | 0 | 0 |
| sll0654 | 30245.13918 | 7.852965 | 0.1019336 | 76.532346 | 0 | 0 |
| sll0723 | 3622.55188 | 3.206341 | 0.0868611 | 36.884368 | 0 | 0 |
| sll0594 | 1442.15122 | 4.188438 | 0.1163392 | 35.779311 | 0 | 0 |
| sll0721 | 4270.71116 | 3.960210 | 0.1195234 | 33.088664 | 0 | 0 |
| ppk | 2413.24297 | 2.789296 | 0.0891573 | 31.260566 | 0 | 0 |
| sll0722 | 914.01220 | 3.615031 | 0.1224730 | 29.350124 | 0 | 0 |
| glnH/glnP | 1595.09671 | -2.598133 | 0.1024505 | -25.343683 | 0 | 0 |
| sll1291 | 3101.15950 | 2.053192 | 0.0858406 | 23.912676 | 0 | 0 |
| sll1552 | 1690.38691 | 8.125828 | 0.2621408 | 23.555387 | 0 | 0 |
| sll0685 | 763.61650 | 2.882493 | 0.1294801 | 22.196748 | 0 | 0 |
| slr0607 | 652.42191 | 3.039183 | 0.1394017 | 21.710851 | 0 | 0 |
| slr1593 | 1266.53182 | 2.510339 | 0.1176945 | 21.305953 | 0 | 0 |
| slr0376 | 12707.49112 | -2.874160 | 0.1368229 | -21.004237 | 0 | 0 |
| sll0595 | 295.19046 | 3.065807 | 0.1547884 | 19.584811 | 0 | 0 |
| sll1292 | 1127.69864 | 2.108668 | 0.1083922 | 19.438854 | 0 | 0 |
| hisI | 596.84033 | 2.225320 | 0.1145008 | 19.398010 | 0 | 0 |
| sll1293 | 881.49736 | 2.092038 | 0.1129283 | 18.506943 | 0 | 0 |
| slr0096 | 640.03443 | 2.000738 | 0.1129830 | 17.684886 | 0 | 0 |
| sll1483 | 582.45118 | 2.181477 | 0.1297207 | 16.787484 | 0 | 0 |
| slr6042 | 174.36264 | 2.353801 | 0.1797602 | 12.985387 | 0 | 0 |
| sll7065 | 264.30013 | -2.112276 | 0.1636629 | -12.876342 | 0 | 0 |
| sll7064 | 134.45549 | -2.272804 | 0.1815262 | -12.442298 | 0 | 0 |
| petJ | 255.62540 | -2.064304 | 0.1668064 | -12.348665 | 0 | 0 |
| sll0720 | 176.92265 | 5.795054 | 0.3339371 | 11.349673 | 0 | 0 |
| sll1863 | 1971.77673 | -2.709240 | 0.3116859 | -8.694742 | 0 | 0 |
| slr0364 | 754.24096 | 2.176862 | 0.2574683 | 8.446758 | 0 | 0 |
| sll5123 | 45.90136 | 2.323667 | 0.2846438 | 7.893656 | 0 | 0 |
DEseq data was exported to ./data/processed/transcriptomics/20200305_transcriptomics_deseq.csv